source code model
INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers
Karmakar, Anjan, Robbes, Romain
Pre-trained models of source code have recently been successfully applied to a wide variety of Software Engineering tasks; they have also seen some practical adoption in practice, e.g. for code completion. Yet, we still know very little about what these pre-trained models learn about source code. In this article, we use probing--simple diagnostic tasks that do not further train the models--to discover to what extent pre-trained models learn about specific aspects of source code. We use an extensible framework to define 15 probing tasks that exercise surface, syntactic, structural and semantic characteristics of source code. We probe 8 pre-trained source code models, as well as a natural language model (BERT) as our baseline. We find that models that incorporate some structural information (such as GraphCodeBERT) have a better representation of source code characteristics. Surprisingly, we find that for some probing tasks, BERT is competitive with the source code models, indicating that there are ample opportunities to improve source-code specific pre-training on the respective code characteristics. We encourage other researchers to evaluate their models with our probing task suite, so that they may peer into the hidden layers of the models and identify what intrinsic code characteristics are encoded.
Source Code Data Augmentation for Deep Learning: A Survey
Zhuo, Terry Yue, Yang, Zhou, Sun, Zhensu, Wang, Yufei, Li, Li, Du, Xiaoning, Xing, Zhenchang, Lo, David
The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and generalizability) of these models. Although a series of DA methods have been proposed and tailored for source code models, there lacks a comprehensive survey and examination to understand their effectiveness and implications. This paper fills this gap by conducting a comprehensive and integrative survey of data augmentation for source code, wherein we systematically compile and encapsulate existing literature to provide a comprehensive overview of the field. We start with an introduction of data augmentation in source code and then provide a discussion on major representative approaches. Next, we highlight the general strategies and techniques to optimize the DA quality. Subsequently, we underscore techniques useful in real-world source code scenarios and downstream tasks. Finally, we outline the prevailing challenges and potential opportunities for future research. In essence, we aim to demystify the corpus of existing literature on source code DA for deep learning, and foster further exploration in this sphere. Complementing this, we present a continually updated GitHub repository that hosts a list of update-to-date papers on DA for source code modeling, accessible at \url{https://github.com/terryyz/DataAug4Code}.
Interpreting Pretrained Source-code Models using Neuron Redundancy Analyses
Sharma, Arushi, Hu, Zefu, Quinn, Christopher, Jannesari, Ali
Neural code intelligence models continue to be 'black boxes' to the human programmer. This opacity limits their application towards code intelligence tasks, particularly for applications like vulnerability detection where a model's reliance on spurious correlations can be safety-critical. We introduce a neuron-level approach to interpretability of neural code intelligence models which eliminates redundancy due to highly similar or task-irrelevant neurons within these networks. We evaluate the remaining important neurons using probing classifiers which are often used to ascertain whether certain properties have been encoded within the latent representations of neural intelligence models. However, probing accuracies may be artificially inflated due to repetitive and deterministic nature of tokens in code datasets. Therefore, we adapt the selectivity metric originally introduced in NLP to account for probe memorization, to formulate our source-code probing tasks. Through our neuron analysis, we find that more than 95\% of the neurons are redundant wrt. our code intelligence tasks and can be eliminated without significant loss in accuracy. We further trace individual and subsets of important neurons to specific code properties which could be used to influence model predictions. We demonstrate that it is possible to identify 'number' neurons, 'string' neurons, and higher level 'text' neurons which are responsible for specific code properties. This could potentially be used to modify neurons responsible for predictions based on incorrect signals. Additionally, the distribution and concentration of the important neurons within different source code embeddings can be used as measures of task complexity, to compare source-code embeddings and guide training choices for transfer learning over similar tasks.
Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation
Karmakar, Anjan, Prenner, Julian Aron, D'Ambros, Marco, Robbes, Romain
The Codex model has demonstrated extraordinary competence in synthesizing code from natural language problem descriptions. However, in order to reveal unknown failure modes and hidden biases, such large-scale models must be systematically subjected to multiple and diverse evaluation studies. In this work, we evaluate the code synthesis capabilities of the Codex model based on a set of 115 Python problem statements from a popular competitive programming portal: HackerRank. Our evaluation shows that Codex is indeed proficient in Python, solving 96% of the problems in a zero-shot setting, and 100% of the problems in a few-shot setting. However, Codex exhibits clear signs of generating memorized code based on our evaluation. This is alarming, especially since the adoption and use of such models could directly impact how code is written and produced in the foreseeable future. With this in mind, we further discuss and highlight some of the prominent risks associated with large-scale models of source code. Finally, we propose a framework for code-synthesis evaluation using variations of problem statements based on mutations.
Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora
Dau, Anh T. V., Nguyen-Duc, Thang, Thanh-Tung, Hoang, Bui, Nghi D. Q.
Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of developing better neural source code models from a data-centric perspective, which is a key driver for developing useful source code models in practice.
Energy-bounded Learning for Robust Models of Code
In programming, learning code representations has a variety of applications, including code classification, code search, comment generation, bug prediction, and so on. Various representations of code in terms of tokens, syntax trees, dependency graphs, code navigation paths, or a combination of their variants have been proposed, however, existing vanilla learning techniques have a major limitation in robustness, i.e., it is easy for the models to make incorrect predictions when the inputs are altered in a subtle way. To enhance the robustness, existing approaches focus on recognizing adversarial samples rather than on the valid samples that fall outside a given distribution, which we refer to as out-of-distribution (OOD) samples. Recognizing such OOD samples is the novel problem investigated in this paper. To this end, we propose to first augment the in=distribution datasets with out-of-distribution samples such that, when trained together, they will enhance the model's robustness. We propose the use of an energy-bounded learning objective function to assign a higher score to in-distribution samples and a lower score to out-of-distribution samples in order to incorporate such out-of-distribution samples into the training process of source code models. In terms of OOD detection and adversarial samples detection, our evaluation results demonstrate a greater robustness for existing source code models to become more accurate at recognizing OOD data while being more resistant to adversarial attacks at the same time. Furthermore, the proposed energy-bounded score outperforms all existing OOD detection scores by a large margin, including the softmax confidence score, the Mahalanobis score, and ODIN.
On-the-Fly Adaptation of Source Code Models using Meta-Learning
Shrivastava, Disha, Larochelle, Hugo, Tarlow, Daniel
The ability to adapt to unseen, local contexts is an important challenge that successful models of source code must overcome. One of the most popular approaches for the adaptation of such models is dynamic evaluation. With dynamic evaluation, when running a model on an unseen file, the model is updated immediately after having observed each token in that file. In this work, we propose instead to frame the problem of context adaptation as a meta-learning problem. We aim to train a base source code model that is best able to learn from information in a file to deliver improved predictions of missing tokens. Unlike dynamic evaluation, this formulation allows us to select more targeted information (support tokens) for adaptation, that is both before and after a target hole in a file. We consider an evaluation setting that we call line-level maintenance, designed to reflect the downstream task of code auto-completion in an IDE. Leveraging recent developments in meta-learning such as first-order MAML and Reptile, we demonstrate improved performance in experiments on a large scale Java GitHub corpus, compared to other adaptation baselines including dynamic evaluation. Moreover, our analysis shows that, compared to a non-adaptive baseline, our approach improves performance on identifiers and literals by 44\% and 15\%, respectively. Our implementation can be found at: https://github.com/shrivastavadisha/meta_learn_source_code